FILE  COPY 


UNCLASSIFIED 


SECURITY  CLASSIFICATION  OF  THIS  PAGE  (Whan  Data  Entarad) 


•  REPORT  DOCUMENTATION  PAGE 

READ  INSTRUCTIONS 

BEFORE  COMPLETING  FORM 

3.  RECIPIENT'S  CATALOG  NUMBER 

«.  TITLE  (and  Subtltla) 

HIGH-PERFORMANCE  BANDED  AND  PROFILE 
EQUATION  SOLVERS  FOR  THE  CRAY-1: 

I.  THE  UNSYMMETRIC  CASE 

5.  TYPE  OF  REPORT  6  PERIOD  COVERED 

Interim 

6.  PERFORMING  ORG.  REPORT  NUMBER 

SEL  #  160 . .  . . 

7.  AUTHORS 

D.  A •  Calahan 

8.  CONTRACT  OR  GRANT  NUMBER^ 

AFOSR  80-0158 

9.  PERFORMING  ORGANIZATION  NAME  AND  AOORESS 

University  of  Michigan 

Dept,  of  Elec.  §  Computer  Engring. 

Ann  Arbor,  MI,  48109 

It.  CONTROLLING  OFFICE  NAME  AND  ADDRESS 

Air  Force  Office  of  Scientific  Research (NM) 
Bolling  AFB,  Washington,  DC,  20332 

12.  REPORT  DATE  1 

February  1.  1982 

13.  NUMBER  OF  PAGES 

4Q 

U.  MONITORING  AGENCY  NAME  &  ADDRESSf//  different  from  Controlling  Office) 

15.  SECURITY  CLASS,  (of  this  report) 

UNCLASSIFIED 

15«.  DECLASSIFICATION/  DOWN  GRADING 
SCHEDULE 

1';.  DISTP  -'.i’iCN  STATEMENT  (ot  t his  Report) 

Approved  for  public  release;  ai^tiibutinn  unlimited 

17.  DISTRIBUTION  STATEMENT  ( ol  tha  abatracl  antarad  In  Block  20,  If  dlllarant  from  Raport) 

16.  SUPPLEMENTARY  NOTES 

19.  KEY  WORDS  (Continue  on  revere*  eide  it  neceeeery  end  identify  by  block  number) 

Sparse  matrices 

Parallel  processing 

Vector  processing 

Linear  algebra 

20.  ABSTRACT  (Continue  on  reveree  eide  If  neceeeery  end  identify  by  b/ocJc  number) 

This  report  describes  algorithms,  performance,  applications,  and 
user  information  associated  with  two  equation-solving  codes  for  the 
CRAY-1:  (1)  Solution  of  a  single  banded  matrix  equation, unsymmetric 
in  value  but  symmetric  in  structure;  (2)  Solution  of  a  single 
profile  matrix  equation,  unsymmetric  in  valus  and  in  structure. 

Both  solvers  assume  that  the  matrix  is  main-memory  resident. 

DD  1473  EDITION  OF  I  NOV  65  IS  OBSOLETE 


_ UNCLASSIFIED _ 

SECURITY  CLASSIFICATION  OF  THIS  PAGE  (Whan  Data  Zntarad) 


High  Performance 
Banded  and  Profile 
Equation-Solvers  for 
the  CRAY-1 


I.  The  Unsymmetric  Case 


D.  A.  Calahan 


Systems  Engineering  Laboratory 
University  of  Michigan 
Ann  Arbor,  Michigan  48109 
February  1,  1982 


SEL  Report  #160 


Sponsored  by  the  Directorate  of  Mathematical 
and  Information  Sciences,  Air  Force 
Office  of  Scientific  Research,  under 
Grant  80-0158 


..  - 
WA'i  *  ;  _ 


-5i.cn 

i  O  H  i-1  -1  ” 


ABSTRACT 


i 

This  report  describes  algorithms,  performance,  applications, 
and  user  information  associated  with  two  equation-solving  codes  for 
the  CRAY-1: 

(1)  Solution  of  a  single  banded  matrix  equation,  unsymmetric 
in  value  but  symmetric  in  structure; 

(2)  Solution  of  a  single  profile  matrix  equation,  unsymmetric 
in  value  and  in  structure. 

Both  solvers  assume  that  the  matrix  is  main-memory  resident.  The 
former  partitions  the  matrix  internally  to  achieve  high  performance. 
The  latter  requires  a  user-supplied  blocking  of  the  LU  structure, 
an  inconvenience  compensated  by  higher  performance  in  solution  of 
finite  difference  and  a  finite  element  grids. 

These  codes  are  available  as  part  of  a  library  of  CAL-coded 
equation-solvers  ,[13} . 

j  PREFACE 

The  mathematical  software  described  herein  is  the  result  of 
experimental  research  on  vector  algorithms  for  the  direct  solution 
of  2-D  finite  difference  and  finite  element  grids.  The  latter  code 
represents  what  is  thought  to  be  the  best  compromise  between  vector- 
izability,  sparsity  exploitation,  and  user  convenience  for  such 

problems  for  the  CRAY-1. 

I  ' 

REVISION  NOTICE 

Pages  18f  were  revised  on  August  15,  1982,  to  include 
discussion  of  blocking  algorithms. 
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I.  Banded  and  General  Sparsity  Solution 
A.  Introduction 

Studies  of  the  direct  solution  of  2-D  finite  element  grids 
have  tended  to  take  one  of  two  directions,  depending  on  the  nature 
of  the  sparsity. 

1.  Band-related  methods.  Solvers  that  recognize  sparsity 
principally  outside  a  band  around  the  diagonal  are  termed 
band-related  solvers.  This  bandwidth  may  vary  -  envelope, 
skyline,  or  profile  solvers  -  and  may  be  implicit  -  as  in 
frontal  methods  where  a  matrix  is  never  fully  formed. 

These  solvers  account  for  an  easy  majority  of  direct  solu¬ 
tion  methods  in  production  codes. 

2.  General  sparsity  methods.  The  early  work  of  George  [5] 
indicated  for  the  first  time  the  possibility  of  reduced 
operation  counts  using  codes  that  permit  an  arbitrary 
sparsity  pattern,  termed  general  sparsity  methods.  This 
spawned  a  number  of  research  studies  on  such  methods. 

It  now  appears  that  general  sparsity  methods,  at  least  when 
applied  to  matrices  sized  to  reside  in  the  memory  of  current  pro¬ 
cessors,  are  difficult  to  vectorize  [2] [6] [7],  Rather,  only  "large- 
scale"  sparsity  patterns  can  be  vectorized  to  achieve  a  high  perfor¬ 
mance.  This  conclusion  is  documented  in  the  following  section. 
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B.  General  Sparse  Solvers 

These  solvers  accept  arbitrary  sparsity  structure  in  column- 
or  row-ordered  form  and  an  associated  pivot  order.  After  a  pre¬ 
processing  step  to  generate  the  LU  fill  structure,  multiple  numer¬ 
ical  solutions  may  then  be  performed  [8] .  Early  studies  of  vector- 
ization  of  these  scalar  algorithms  sought  to  maintain  the  same 
input  data  structure  and  carry  out  the  same  number  of  floating  point 
operations  as  their  scalar  predecessors  [9].  The  imposition  of 
these  two  constraints  was  justifiable  to  establish  a  performance 
standard  that  may  be  achieved  without  alteration  of  common  user  data 
structures  and  without  introduction  of  the  issue  of  trading  off 
floating  point  computation  for  higher  vector  performance.  By  defin¬ 
ing  vectors  within  dense  regions  of  the  LU  structure,  an  average 
vector  length  (2.)  was  defined  [6]  .  It  was  possible  to  establish 
the  vectorizability  of  the  solution  of  finite  element  grids  exploit¬ 
ing  such  density  [6],  Because  l  increases  monotonically  with  grid 
size,  sufficiently  large  2-D  grids  can  always  be  satisfactorily  vec¬ 
torized. 

C.  Block-Oriented  Solvers 

Unfortunately,  as  nxn  2-D  grids  increase,  operation  counts  in- 

3 

crease  at  least  as  0 (n  )  and  direct  solution  methods  become  less  at¬ 
tractive  than  iterative  techniques.  However,  vector  processors  have 
made  the  solution  of  3-D  and  time-dependent  2-D  problems  feasible; 
in  such  cases,  repeated  direct  solution  of  a  moderate-sized  2-D  grid 
often  appears  as  a  computational  kernel  in  a  global  iterative  strategy. 
For  such  moderate-sized  grids,  one  cannot  depend  on  randomly-produced 
density  to  achieve  long  vectors. 
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The  first  concession  to  vectorization  must  be  to  abandon  the 

traditional  general  sparsity  input  data  structures.  Rather,  the 
* 

user  must  assist  in  the  vectorization  process  by  detecting  either 
repeated  [10]  or  dense  [2]  substructures  in  the  matrix. 

For  the  CRAY-1,  vectorized  block  reduction  can  be  performed 
rather  efficiently  [2].  Such  blocks  are  easily  detected  at  the 
local  level  when  many  (r)  unknowns  are  associated  with  each  node 
and/or  a  finite  element  has  a  large  number  of  associated  unknowns  [4]; 
all  operations  can  then  be  visioned  as  occurring  between  rxr  full 
matrices.  The  overall  execution  rate  (in  MFLOPS)  can  often  be  esti¬ 
mated  from  the  rate  of  the  multiply-accumulate  kernel  that  accounts 
for  the  major  part  of  the  computation  in  the  reduction  of  each  pivot 
block.  Since  no  extra  computation  is  performed  in  such  block-oriented 
elimination,  the  solution  time  is  inversely  related  to  this  execution 
rate . 

A  study  of  the  solution  time  with  such  solvers  on  the  CRAY-1 
shows  that  only  part  of  the  speedup  over  scalar  solvers  is  due  to 
vectorization..  A  significant  advantage  also  accrues  from  the  need 
to  address  only  blocks  rather  than  single  elements  of  the  sparse 
matrix,  since  this  processor  is  known  to  be  slow  in  the  indirect  ad¬ 
dressing  mode  (gather/scatter)  associated  with  linked  list  processing. 

D.  Band-Oriented  Solvers 

Even  the  best  coding  on  the  CRAY-1  -acknowledged  to  have  super¬ 
ior  short-vector  performance  -  cannot  achieve  above  20-30  MFLOPS 
with  fewer  than  five  unknowns/node.  To  achieve  rates  in  the  range  of 

100  MFLOPS  requires  the  significantly  larger  dense  substructures  that 
* 

Algorithms  to  prepare  the  structure  for  vectorization  could,  of 
course,  be  considered. 


are  associated  with  inter-nodal  coupling.  Such  coupling  is  usually 


along  a  line  or  adjacent  lines  (multi-line  coupling)  that  yields  a 
banded  matrix  structure.  Indeed,  the  natural  (grid-row)  ordering 
of  irregular  2-D  grids  yields  a  step  banded  structure,  shown  in 
Figure  5(b).  If  one  can  justify  performing  somewhat  extra  computa¬ 
tion  in  the  gaps  between  such  steps,  the  entire  solution  can  be 
performed  with  a  locally-banded  solution  mode. 

It  is  the  conclusion  of  the  experiences  to  be  reported  that 
such  block  profile  matrix  solution  offers  the  best  performance  com¬ 
promise  between  bandsolvers  that  assume  an  absolutely  regular  sparsity 
structure  and  general  sparse  solvers  that  permit  random  structures. 

The  study  of  such  a  "large-scale  sparsity"  methods  is  the  goal  of 
this  research. 

2.  Report  Summary 

To  establish  a  standard  for  performance  comparison  it  was  essent¬ 
ial  to  first  code  an  efficient  bandsolver  in  assembly  language  (CAL) . 
This  was  a  non-trivial  task;  the  memory-hierarchial  CRAY-1  architectur 
required  a  partitioned  solution  process.  The  first  part  of  this  re¬ 
port  describes  the  algorithms,  implementation,  and  performance  of 
this  software.  Its  speedup  over  Fortran  implementations  makes  this 
useful  in  its  own  right. 

The  block  profile  solution  is  then  discussed  and  is  liberally 
documented  with  examples  to  give  insight  into  the  class  of  problems 
for  which  it  yields  improved  performance  over  the  above  standard. 
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Solution  of  a  Single  3anded  Matrix  Equation 


A.  Memory-resident  Banded  Systems 

In  reference  [1] ,  Jordan  has  presented  an  algorithm  for  solving 
a  banded  set  of  equations  on  the  CRAY-1  and  gave  the  performance  of 
associated  CAL  software.  Unfortunately,  the  64-length  vector  limita¬ 
tion  of  the  CRAY-1  resulted  in  the  code  being  applicable  to  matrices 
with  half-bandwidths  b264.  This  part  of  , the  report  describes  the 
design  and  performance  of  software  that  does  not  have  this  restriction 
It  can  be  argued  that  very  large  banded  sets  of  equations  cannot 
be  solved  with  the  entire  matrix  resident  in  main  memory,  an  assump¬ 
tion  of  the  code  to  be  described.  Nonetheless,  an  intermediate  range 
of  matrix  sizes  beyond  the  above  bandwidth  restriction  can  be  stored 
in  even  %-megaword  configurations,  and  yet  solved  in  fractions  of  a 
second.  With  up  to  4-megaword  systems  in  the  offing,  it  is  likely 
that  the  majority  of  banded  matrices  will  be  beyond  the  capability  of 
Jordan's  code. 

Because  most  banded  matrices  arise  from  the  solution  of  partial 
differential  equations,  consider  the  banded  matrix  produced  by  apply¬ 
ing  the  5-point  finite  difference  formula  to  the  2D  grid  of  Figure  1, 
where 

n  is  the  shorter  grid  dimension 
s 

n^  is  the  longer  grid  dimension 
u  is  the  number  of  unknowns  per  grid  point 
k,=n  /n,  is  the  ratio  of  grid  dimensions 


nr9 


n  =5 
s 


u  unknowns/grid  point 


k  =n  /n 
Z  Z  s 


Figure  1.  Definition  of  grid  descriptors 
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Table  1.  b  :n  as  a  function  of  matrix 

max  smax 

storage  (megawords)  and  number  of  unknowns/ 
grid  point;  examples  below  dashed  line  have 


The  matrix  is  of  order  n=ungn^,  with  a  half-bandwidth  b=’ 
The  matrix  storage  in  compressed  form  (Figure  2)  is 

s  =  (2u(ns+l) -1) (ungn£) 

=  (2u(ns+l)-l)  (uk£ns2) 


For  n  >>1, 
s 

2  3 

s  «  2k.u  n 
£  s 

Asymptotically, 


1/3 


n  = 
s 


2k£u 


and  the  half-bandwidth  is 


b 


un 

s 

•794u1/3 (s/k.) 


1/3 


-1. 


(1) 


(2) 


(3) 


With  b<64,  it  is  clear  from  (3)  that  grids  with  a  constant  k^  will 
be  more  impacted  by  this  restriction  as  u  increases.  The  precise 
nature  of  this  restriction  is  indicated  in  Table  1.  For  example, 
a  one-megaword  processor  with  a  grid  dimension  ratio  k^  =  2  will 
accommodate  the  matrix  only  when  u^l ;  yet  the  associated  equations 
can  be  solved  in  only  330  msec  at  100  MFLOPS ,  a  representative  exe¬ 
cution  rate.  Even  a  square  50  x  50  grid  with  u=4,  executing  on  a 
four-megaword  system,  can  be  solved  in  only  8  seconds,  and  yet 


i 

L 
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generates  a  half-bandwidth  of  203,  well  beyond  the  64-length  limita¬ 
tion  . 

In  conclusion,  it  appears  that  many  2D  grids  producing  matrices 
with  bandwidths  greater  than  64  can  be  solved  in  reasonable  time 
with  direct  methods  on  present  and  near-term  memory  configurations. 
The  more  general  program  to  be  described  can  be  expected  to  extend 
the  usefulness  of  many  application  codes  that  are  based  on  direct 
methods  to  such  problems. 


B.  Algorithms  and  Implementation 

1.  Problem  statement  and  solution 

It  is  desired  to  solve  a  banded  set  of  equations 

AX  =  B 

where  A  is  an  n  x  n  unsymmetric  matrix  of  half-bandwidth  m,  and  is 
sufficiently  well-conditioned  that  pivoting  is  not  required. 

The  solution  is  performed  by  partitioned  LU  factorization  of 
A  in  one  subroutine  (BANFAC)  and  by  partitioned  forward  and  back 
substitution  in  a  second  subroutine  (BANSOL) . 

2.  Storage  options 

In  different  applications  disciplines,  it  is  customary  to  store 
the  matrix  in  one  of  two  compressed  formats.  In  row  storage  format, 
illustrated  in  Figure  2(b),  the  diagonals  are  stored  in  rows  of  a 
(2m+l)  x  n  array;  in  column  storage  format,  the  diagonals  are  stored 
in  columns  of  a  n  x  (2m+l)  array.  The  voids  are  assumed  to  be  zero¬ 
valued. 

3.  Inner  loop  algorithm 

The  algorithm  of  Jordan's  inner  loop  code  is  adopted  for  this 
more  general  code,  although  the  coding  is  somewhat  different. 

Jordan's  LU  factorization  uses  the  following  accumulation  for 
the  jth  columns. 

n(k+l)  _  (k)  T 

r+l:s,j  Ur+l:s,j  Urj  r+l:s,r  (5) 


(k+1) 

'  j+1 : t , j 


L  . 

J+1: t, j 


L  j+1 :  t ,  j  urjLj+l:t,r 


where  s=min ( r+m, j ) ,  t=min ( r+m, n) ,  r=r  ,  ...,j-l,  k=r-r0,  r  =max(j-m,l) 
and  U  ,  • (L  ,  .)  represents  a  vector  of  components  u. .(£. .)  with 

a.D,ja:D,j  1313 

asisb. 

Since  the  calculation  of  u^  .  .  is  completed  after  the  kth 

r0 +K ,  J 

step  and  since  the  component  u.  .  (or  1.  .)  is  unaffected  by  the  accumu- 

iJ  ij 

lation  until  isr+m,  the  vector  length  remains  m  until  r+m>n.  After 

each  kth  step,  the  first  vector  element  u  .  or  ?•  ,  .  is  removed 

r0+k,  j  +k  ,  r 
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I 

by  a  vector  shift  and  a  new  final  element  u  ,  , ,  .  or  l  .  added 

-1  r+m+1 ,  j  r+m+l,] 

at  the  end  of  vector, 
rj 

Once  removed  from  the  vector,  the  completed  element  immediately 
becomes  a  scalar  multiplier  in  (5)  and  (6).  This  removal  pro¬ 

cess  creates  a  delay  in  Jordan's  code,  and  a  subsequent  small  loss 
^  in  asymptotic  MFLOP  rate.  The  delay  is  removed  in  the  new  code 

•  by  precalculating  the  first  vector  element  in  scalar  mode.  The 

resulting  accumulation  loop  has  a  timing  formula 

T£  =  17  +  VL  VL>30  (8) 

where  VL  is  the  vector  length;  the  associated  execution  rate  is  126 
MFLOPS  for  VL=64.  For  VL<30,  the  timing  is  approximately  a  constant 
47  clocks. 

4.  Partitioning 

To  extend  Jordan's  code  beyond  half-bandwidths  of  64  while  main¬ 
taining  the  high  execution  rates  associated  with  vector  accumulation 
loops  such  as  the  above,  the  matrix  must  be  partitioned  into  64  x  64 
blocks,  noted  by  Jordan  [1]  for  full  matrices  and  Calahan  [2]  for 
block  sparse  matrices.  In  [2],  the  partitioning  of  banded  matrices 
was  performed  into  square  blocks  and  "bandedge"  blocks,  with  a  de¬ 
gradation  in  processing  the  latter.  In  the  present  code,  the  parti¬ 
tioning  is  performed  into  diagonal  blocks,  as  illustrated  in  Figure 
3.  Loop©  is  a  single  vector  operation;  loop  (2)  is  the  inner  loop 
of  vector  operations  which  terminate  after  64  accumulations  into  the 
jth  column.  In  loop  Q,  the  next  64  elements  of  the  same  64  columns 
are  accumulated  into  the  jth  column.  The  scalar  pre-calculation 
noted  above  for  the  inner  loop  is  unnecessary  for  the  blocks  non- 
ad  jacent  to  the  diagonal,  and  a  somewhat  more  efficient  accumulation 
loop  is  utilized.  Loop  (3)  continues  until  the  bandedge  is  encountered. 
Loop©  then  advances  the  accumulation  to  the  next  column  of  64  blocks, 
as  shown  in  Figure  3. 

When  the  bottom  of  the  matrix  is  encountered  in  the  processing 
of  a  block,  one  is  faced  with  either  testing  for  this  condition  in 
the  inner  loop  --  and  thus  adding  a  fixed  inner  loop  overhead  —  and 
then  reducing  the  vector  length,  or  simply  carrying  out  the  additional 
floating  point  calculations.  It  happens  that  the  matrix  storage  for¬ 
mat  permits  the  latter  during  factorization,  so  that  this  procedure 
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Figure  2.  Band  matrix  storage  formats. 


was  chosen.  The  extra  block  processing  is  indicated  by  dashed  lines 
in  Figure  3.  For  large  n/m  ratios,  the  faction  of  extra  computa¬ 
tion  is  bounded  by  m/3n.  During  the  substitution  steps,  the  ratio 
is  bounded  by  m/2n;  also,  the  right  hand  side  is  relocated  on 
entry  and  exit,  to  provide  requisite  void  storage  space. 


C.  Performance 

In  comparison  with  the  timing  of  Jordan's  unpartitioned  code, 
the  partition.:  -  .  ir.  r  .  :r.ewhat  more  overhead  to  implement, 

the  partitioning  and  to  allow  row-  or  column-ordered  storage.  However, 
use  of  a  simulator  [3]  has  allowed  a  somewhat  more  careful  attention 
to  inner  loop  timing  (see  below) .  This  effort  is  deemed  worthwhile, 
since  the  direct  solution  of  large  dense  matrix  equations  inevitably 
dominates  the  equation  formulation,  with  the  result  that  the  total 
execution  time  tends  to  be  closely  related  to  the  performance  of 
this  inner  loop. 

Table  2  gives  the  performance  statistics*  of  the  partitioned 
versus  the  unpartitioned  code.  As  indicated,  the  specialized  inner 
loop  coding  more  than  compensates  for  the  outer  loop  overhead.  Also 
the  execution  rates  for  even  small  bandwidths  easily  exceed  those 
of  scalar  processors  (in  the  order  of  1-5  MFLOPS) . 

The  effects  of  partitioning  are  shown  in  Figure  4.  For  band- 
widths  less  than  65,  no  partitioning  is  necessary  and  all  vector 
lengths  are  equal  to  the  bandwidth.  For  65smsl28,  the  average 
vector  length  is  approximately  m/2.  A  resultant  sharp  drop  in  execu¬ 
tion  rate  occurs  for  m=6  5.  For  129  m-jl92,  the  average  length  is 
2m/3,  so  that  the  decrease  for  m=129  is  less  severe. 

figure  4  also  presents  the  measured  rates  of  this  software  versus 
the  Fortran-coded  LINPACK  bandsolvers  SGBFA  and  SGBSL  available  at  CRI 
(a  slower  version  of  these  codes  on  the  UCS  system  in  9/81  was  also 
tested) .  Because  these  codes  allow  pivoting  (which  commonly  accounts 
for  25-30%  of  the  solution  time  in  CAL  for  large  matrices),  this  com¬ 
parison  is  somewhat  unfair  to  these  Fortran  codes. 

It  is  perhaps  more  intuitive  to  relate  the  performance  directly 
to  the  grid  from  which  banded  matrices  are  normally  derived.  Table 
3  gives  the  computing  times  and  execution  rates  for  a  number  of  cases 
up  to  the  storage  limits  of  a  megaword  machine.  The  execution  rates 
uniformly  are  in  the  range  of  100  MFLOPS  for  large  problems,  ranging 
up  to  the  limit  118  MFLOPS  for  a  64  x  64  grid  with  one  unknown  per 
grid  point. 

*A11  performance  results  in  this  report  are  derived  from  runs  on 
the  megaword  CRAY-1  configuration  at  United  Computing  Systems,  Inc. 


Half  Bandwidth 

Time  :  Rate 
(ms  :  MFLOPS*) 

Improvement 

Factorization 

8 

1.88:18.2 

2.1 

16 

2.97:43.6 

1.9 

32 

5.66:86.0 

1.7 

64 

16.0:110. 

1.3 

96 

35.9:99.2 

— 

128 

54.6:103. 

— 

Solution 

8 

.315:26.4 

1.3 

16 

.316:51.0 

1.3 

32 

.358:86.1 

1.3 

64 

.565:102 

1.1 

96 

.921:87.1 

— 

128 

1.14:86.2 

— 

*Extra  computation 

ignored  in  operation 

count  (see 

Table  2.  Timing  performance  of  partitioned  code  for 
256  equations,  and  comparison  with  Jordan's 
original  code  [1] . 


EXECUTION  RATE  (MFLOPS) 


2 

4 

i 

1.52:45.4 

6.50:89.0 

] 

12.3:88.2 

115:76.8 

221:96.7 

51.1:106. 

224  :  76 . 2 

445:93.3 

413:107. 

718:114. 

i 

i  _ 

23.6:88.4 

283:118 


(a)  Factorization  (BANFAC) 


.085:28.1 
. 315:51.0 


2 

.162: 

;  50 . 1 

.  720: 

:  91- 4 

2.  07: 

:  107 

7.05; 

:  74 . 6 

11.1: 

:  92. 7 

1.40:93.0 

8.84:118. 


(b)  Solution  (BANSOL) 


.376:89.1 

3.49:76.2 

5.43:95.6 

8.55:105. 

12.7:112. 


Table  3.  Time  (msec):  execution  rate  (MFLOPS)  as  a 
function  of  square  grid  size  (n  )  and 
number  of  unknowns  per  grid  point  (u) . 
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.  Software  Description 

1.  Calling  sequence 

Factorization 

CALL  BANFAC (N , M , A , NDI AG , NDROW ) 

Substitution 

CALL  BANSOL (N, M, A, NDI AG, NDROW, B) 

where 

N  is  the  number  of  equations 

M  is  the  half-bandwidth,  not  including  diagonal 
A  is  the  compressed  band  matrix  array 
NDIAG  is  the  diagonal  addressing  increment 
NDROW  is  the  row  addressing  increment 
B  is  the  right  hand  side  and  the  solution  array 

2.  Explanation 

1.  NDIAG  is  the  storage  addressing  increment  between  logical 
matrix  positions  (i,j)  and  (i+l,j+l),  i.e.  between  successive 
diagonal  elements.  For  row  storage,  NDIAG^2*M+1;  for  column  storage, 
NDIAG=1. 

2.  NDROW  is  the  storage  addressing  increment  between  logical 
matrix  positions  (i,j)  and  (i+l,j),  i.e.  between  successive  column 
elements.  For  row  storage,  NDR0W=1;  for  column  storage,  NDR0W=-(N-1) 

3.  Restrictions 

1.  Storage  must  be  zero-valued  beyond  active  compressed  banded 
storage  (Figure  2)  . 

2.  Storage  for  B  must  be  at  least  N+2*M. 

3.  I NDROW  !  should  not  be  a  multiple  of  8,  to  avoid  bank 

conflicts . 


III.  Solution  of  a  Block  Profile  Matrix  Equation 
A.  Motivation  for  Profile  Solution 

Profile  matrices  tend  to  arise  in  two  ways. 

1.  The  "natural"  column-by-column  (or  row-by-row)  reduction 
of  nodes  in  a  2-D  grid  produces  a  banded  matrix  for  rec¬ 
tangular  grids  only.  For  grids  with  irregular  external 
boundaries,  a  variable  bandwidth  (profile)  matrix  re¬ 
sults  . 

2.  Floating  point  computation  can  be  approximately  halved  in 
solution  of  a  large  grid  defined  by  a  5-point  operator, 
by  first  reducing  unrelated  nodes.  This  step  requires 
insignificant  computation  for  large  grids;  however,  it 
halves  the  matrix  size  and  leaves  the  bandwidth  unchanged. 
If  the  unrelated  nodes  are  eliminated  along  alternate 
diagonals  (termed  AD  or  D4  ordering  [11])  and  then  the 
remaining  nodes  numbered  along  diagonals,  a  profile  matrix 
results.  The  LU  factors  of  such  a  matrix  are  illustrated 
in  Figure  5(b)  for  the  8x12  grid  of  Figure  5(a).  By  ex¬ 
ploiting  this  profile,  floating  point  operations  can  be 
reduced  by  another  factor  of  2  for  a  large  square  grid. 

In  the  examples  of  this  report,  it  is  assumed  that  the  alter¬ 
nate  diagonals  have  been  eliminated  in  the  first  step  and  only  a 
profile  matrix  remains.  On  the  CRAY-1,  the  vectorization  of  this 
first  step  is  highly  dependent  on  the  regularity  of  matrix  storage, 
since  considerable  data  movement  but  little  computation  is  involved. 
With  random  storage  and  scalar  operations  from  FORTRAN,  a  rate  less 
than  1  MFLOP  may  be  achieved.  With  a  patterned  storage,  from  CAL 
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a  rate  of  70  MFLOPS  has  been  observed.  In  the  first  case,  the 
significance  of  the  time  associated  with  this  first  step  can  be 
ignored  only  with  very  large  grids. 

Storage  permitting,  the  profile  matrix  could  be  solved  by 
the  bandsolver  previously  described.  If  the  profile  solver  were 
to  operate  at  the  same  execution  rate  as  the  bandsolver  (an  opti¬ 
mistic  assumption) ,  then  on  a  square  grid  the  solution  time  would 
be  halved.  This  factor  of  2  is  therefore  an  upper  limit  on  speed¬ 
up  of  profile  over  banded  solution.  Note  that  this  is  far  less 
than  the  3-5  speedup  factor  which  CAL  achieves  vis-a-vis  CFT  (Fig¬ 
ure  4)  . 

B.  Block  Profile  Solution 

1.  Blocking  Rationale 

The  vectorization  of  the  solution  is  further  assisted  by  "reg¬ 
ularization"  of  the  matrix  structure  into  two-dimensional  blocks, 
for  two  reasons. 

(1)  Fewer  symbolic  descriptors  relating  to  block  size  and 
storage  locations  are  necessary  to  describe  a  block  than 
to  describe  the  same  matrix  elements  either  individually 
or  as  a  collection  of  1-dimensional  dense  columns  or  rows 
(as  in  [9]).  The  processing  of  these  descriptors  can 

add  significant  overhead  to  numeric  processing,  especially 
when  processing  small  blocks  on  a  vector  processor. 

(2)  The  high  speed  CRAY-1  vector  register  set  has  a  single 
critical  data  path  to  main  memory.  The  utilization  of 
this  path  can  be  reduced  by  performing  matrix-matrix  or 
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Figure  5(a).  AD-ordered  8 
sents  nodes  eliminated  in 
Matrix  shown  in  Figure  5  (b 
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Figure  5  (b )  Blocked  profile  matrix  associated 
with  3x12  grid  with  u  =  1;  G  -  zero-valued  posi¬ 
tion  inserted  for  blocking. 
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(b)  Problem  #2,  23x37,  507  equations 


:)  Problem  #3,  55x72,  2323  equations 
Figure  6.  Irregular  grids 
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(a)  Original  LU  factors  (b)  After  column-ordered 

elimination  and  blocking 

Figure  7.  Northwest  61x61  partition  of  507- 
equation  LU  factors;  with  AD  ordering  after 
elimination  of  alternate  nodes;  O-zero-valued 
positions  added  for  blocking. 


matrix-vector  rather  than  vector-vector  operations. 

The  identification  of  matrix  (or  block)  structures  is 
thus  essential  to  achieving  highest  execution  rates. 

2.  Blocking  Attributes 

To  gain  insight  into  desirable  blocking  attributes,  two 
classes  of  problems  are  studied  in  this  report. 

(a)  The  model  grid  of  Figure  1,  reduced  to  a  profile  matrix 
by  D4  ordering. 

(b)  A  set  of  three  irregular  grids  of  Figure  6,  taken  from 
[14].  Unrelated  nodes  are  pre-reduced,  but  in  the  nat¬ 
ural  ordering  of  the  grid  rather  than  along  diagonals. 

The  profile  of  the  resulting  matrix  is  then  similar  to 
the  profile  of  the  matrix  resulting  from  natural  order¬ 
ing  of  the  entire  grid;  however,  the  number  of  equations 
is  approximately  halved. 

The  structure  of  the  LU  matrix  from  an  AD-ordered  model  grid 
has  been  shown  in  Figure  5 (b) .  For  the  irregular  grid  of  Figure 
6(b),  a  submatrix  is  shown  in  Figure  7(a).  In  both  cases,  the 
natural  matrix  boundaries  occur  not  in  rectangular  blocks  but  in 
blocks  whose  boundaries  are  parallel  to  the  diagonal.  This  charact¬ 
eristic  is  consistent  with  the  partitioning  strategy  of  the  band- 
solver;  this  suggests  that  the  accumulation  kernels  of  the  latter 
may  be  used  in  this  case. 

Unfortunately,  these  accumulation  kernels  require  that  a  column 
of  L  be  considered  dense  from  the  diagonal  to  the  largest  row  number 
(contrast,  an  inner-product  accumulation).  As  a  result,  for  the  ir- 


regular  grid  of  Figure  6  (b) ,  the  columns  of  the  L  factor  are 
extended  to  the  bandedge,  as  shown  in  Figure  7(b).  The  result¬ 
ing  extra  computation  associated  with  extra  non-zeros  will  be 
evaluated  later  empirically.  Non-zeros  need  not  be  added  to  the 
U  matrix  for  this  reason. 

A  second  accumulation  kernel  characteristic  is  the  assumption 
that  successive  columns  being  accumulated  begin  and  end  one  row 
number  apart.  This  observation  sets  the  primary  requirement  of 
the  blocking  algorithm  for  L  and  U,  i.e.,  the  identification  of 
two-dimensional  submatrices,  each  bounded  above  and  below  by  diag¬ 
onals  parallel  to  the  main  diagonal  and  by  columns  on  each  side. 

In  this  blocking  process,  one  can  judiciously  add  non-zeros 
to  complete  blocks,  as  in  Figure  3  near  the  southeast  corner  of 
the  banded  matrix.  This  becomes  the  critical  part  of  the  blocking 
algorithm  and  will  be  considered  in  detail  later.  This  addition  of 
non-zeros  in  L  will  suffice  for  blocking  of  the  factorization  and 
forward  substitution.  However,  if  the  back  substitution  is  to  pro¬ 
ceed  efficiently,  blocks  in  U  must  also  be  completed  as  above. 

Matrices  blocked  in  L  and  U  by  the  algorithm  below  are  illus¬ 
trated  in  Figures  5(b)  and  7(b)  for  model  and  irregular  girds,  re¬ 
spectively. 

3.  Blocking  Algorithm 

An  optimal  blocking  algorithm  must  have  as  its  goal  minimiza¬ 
tion  of  the  factorization  (or  solution)  time  by  reducing  the  number 
of  blocks  without  adding  excessive  computation  or  storage.  Develop 
ment  of  this  algorithm  would  proceed  by 


(a)  coding  the  block  factorization  and  solution  algorithm, 

{b)  developing  a  detailed  timing  model  for  each  of  the 
major  loops  in  the  code,  and 

(c)  solving  for  the  location  of  block  separators  that  mini¬ 
mize  the  execution  time. 

This  problem  can  be  phased  as  a  nonlinear  programming  problem. 

The  nonlinearity  arises  from  the  possibility  of  overlap  between 
scalar  and  vector  operations:  scalar  computation  may  hide  a  con¬ 
current  short  vector  operation,  or  it  may  itself  be  hidden  by  a  long 
vector  operation.  The  dependence  of  computation  time  on  vector  length 
is  therefore  a  nonlinear  one.  Even  without  this  dependence,  the 
arbitrary  insertion  of  block  separators  is  an  integer  programming 
problem  and  so  is  beyond  reasonable  solution  time  for  large  profile 
systems. 

A  local  -  and  thus  suboptimal  -  minimization  algorithm  has  been 
developed  that  focuses  on  the  local  irregularities  of  the  profile. 

It  proceeds  as  follows: 

(a)  In  blocking  L,  the  search  direction  proceeds  from  the 
first  to  the  last  column;  U  is  blocked  from  the  last  to 
the  first  column.  The  following  rules  will  apply  to  the 
more  critical  L  blocking;  similar  rules  are  used  to  block  U. 

(b)  The  locality  of  the  algorithm  is  limited  to  three  successive 

columns,  numbered  ra,  rb  =  r  +1,  and  rc  =  rb+l.  If  the 

associated  (half-)  bandwidths  are  M  ,  M,  ,  and  M  ,  then  the 

a  d  c 

block  is  continued  in  tae  search  direction  if  M.  =  M  . 

b  a 

Otherwise,  depending  c  the  value  of  M  ,  either  a  new 
block  is  initiated  in  the  column  rb  or  non-zeros  are  added 
to  continue  the  presei  block.  This  three-column  algorithm 
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has  the  effect  of  insuring  that  a  block  will  be  at  least 
two  columns  wide.  The  flow-chart  for  the  complete  pro¬ 
cess  is  given  in  Figure  8. 

(c)  The  local  process  of  (b)  is  followed  by  a  more  global 

blocking  step  where  two  blocks  are  merged  into  a  single 
block  if  their  half-bandwidths  differ  by  less  than  a 
preset  value. 

4 .  Storage 

Although  storage  is  considered  a  secondary  issue  to  processing 
speed,  storage  alternatives  deserve  consideration  before  selection 
of  one  for  implementation. 

(a)  If  repeated  solutions  are  required  from  the  same  factor¬ 
ization,  then  the  LU  and  the  matrix  (A)  storage  cannot 
overlap.  If  overlapping  is  permitted,  then  the  LU  stor¬ 
age  may  be  allocated  ahead  of  the  matrix  storage  so  that, 
even  with  fills  and  inserted  non-zeros,  the  LU  storage 
does  not  overtake  the  matrix  storage  during  factorization. 

(b)  One  has  the  choice  of  storing  columns  of  each  block 
contiguously,  or  of  interlacing  L  and  U  storage  in  the 
manner  of  banded  matrix  storage.  When  overlapping  LU 
and  the  matrix  storage  is  permitted,  the  choice  can 
become  critical.  For  example,  if  the  matrix  has  a 
constant  profile  so  that  only  one  block  of  U  and  of 

L  are  defined,  then  contiguous  block  storage  will  require 
that  either  the  entire  L  or  the  entire  U  block  be  stored 
ahead  of  the  matrix.  In  this  worst  case,  total  storage 
will  be  at  least  3/2  of  the  matrix  storage.  On  the  other 


hand,  interlacing  columns  of  L  and  U  must  be  performed 
so  that  each  element  is  directly  addressable  from  a 
block  base  address.  This  will  inevitably  leave  gaps 
in  the  LU  storage  not  present  in  A. 

In  the  blocking  algorithm  to  be  discussed,  the  profile  matrix 
is  stored  overlapped  with  A  in  compressed  banded  matrix  format,  i. 
the  storage  location  of  logical  position  (i,j)  in  LU  is  (i-j  + 

HBW  +1,  j),  where  HBW  is  the  maximum  half-bandwidth  of  L  and  U. 

The  above-mentioned  wasted  storage  is  simply  tolerated.  The  LU 
storage  is  allocated  in  array  B  at  compile  time  by  an  EQUIVALENCE 
statement  of  the  form 

EQUIVALENCE  (A{1),  B (L) ) 

where  the  minimum  value  for  L  is  printed  by  the  blocking  subroutine 
It  is  the  user's  responsibility  to  ensure  that  this  amount  has  been 
set  aside  before  proceeding  with  the  numerical  solution.  When  the 
symbolic  blocking  and  the  numeric  solution  are  carried  out  in  dif¬ 
ferent  run  steps,  this  is  no  problem. 

5.  Input  to  Blocking  Program 

Unsymmetric  profile  solvers  conventionally  assume  that  the 
data  is  stored  by  column  or  row.  Whereas  each  column  (row)  is  as¬ 
sumed  stored  compactly,  adjacent  columns  (rows)  may  not  be.  To 
allow  such  non-compact  storage,  the  following  data  must  be  supplied 
by  the  user.  Note  that  only  storage  by  column  is  permitted. 

(1)  symbolic:  the  first  and  last  row  number  in  each  column; 


(2)  numeric:  the  numeric  storage  location  of  the  diagonal 


elements . 

A  characteristic  often  associated  with  finite  difference 
grids  is  the  numbering  of  the  nodes  without  regard  for  the  number 
of  unknowns  (U)  per  grid  point.  Correspondingly,  it  is  often  con¬ 
venient  to  input  the  above  profile  description  assuming  U=1  and 
then  have  the  blocking  algorithm  expand  the  description  and  per¬ 
form  the  blocking  with  U  as  a  parameter. 

In  summary,  only  the  number  of  equations,  the  number  of  un¬ 
knowns  per  grid  point,  and  the  descriptors  of  (1)  and  (2)  are 
required  inputs  for  the  blocking  algorithm.  The  specifics  of  the 
software  description  are  contained  later  in  the  report. 
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C.  Implementation 

The  following  are  major  considerations  in  the  code  development. 

(a)  The  high-performance  kernels  of  the  bandsolver  are  utilized. 
Therefore,  the  performance  of  the  profile  solver  should 
approach  that  of  the  bandsolver  in  the  special  case  of  a 
large  banded  matrix. 

(b)  In  the  event  blocks  are  larger  than  the  64x64  partitions 
of  the  bandsolver,  the  partitioning  requirement  is  imposed 
during  solution.  This  imposition  of  both  blocking  and 
partitioning  strategies  accounts  for  significant  overhead 
in  loops  above  the  accumulation  kernel.  Another  source  of 
overhead  arises  from  the  provision  for  re-formatting  the 
user-supplied  matrix  storage  to  blocked  LU  form  (see  (d) 
below) . 

(c)  Non-zeroes  inserted  into  (J  to  speedup  the  back  substitution 
are  not  processed  during  the  factorization.  From  Figures 
5(b)  and  7(b),  this  tends  to  affect  an  irregular  grid  more 
than  a  model  grid. 

(d)  From  the  timing  formula  for  the  accumulation  kernel  in 
Equation  (8),  it  may  be  argued  that  extending  blocks  of 

L  to  a  length  of  30  will  not  increase  the  aggregate  kernel 
timing.  Thus,  an  L  block  of  length  30  can  cover  all  ad¬ 
jacent  blocks  of  lenjth  i  30,  and  the  total  overhead  of 
processing  block  descriptors  reduced . 
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D.  Performance 


1.  Timing  Evaluation 

The  common  algorithmic  measure  of  MFLOP  rate  is  not  an  authori 
tative  measure  of  timing  performance  for  this  software  since  extra 
computation  results  from  adding  non-zeros  to  produce  blocks. 
Instead,  a  number  of  model  and  irregular  grids  have  been  solving 
using  the  CRAY-1.  Recall  that  for  D4  ordering  alternate  nodes 
have  been  pre-reduced,  leaving  a  profile  matrix. 

Table  4  gives  a  timing  and  storage  summary  of  these  runs. 

In  each  case,  the  bandsolver  bandwidth  was  chosen  equal  to  the 
maximum  profile  bandwidth.  The  timing  ratios  T_:T  :1  gives  the 

r  C 

relative  computation  times  of  the  Fortran  and  CAL  (respectively) 
bandsolvers  relative  to  the  profile  solver.  The  relative  time 
Tc,  with  a  theoretical  upper  bound  of  2  for  a  model  square  grid, 
is  shown  to  be  less  than  1  for  narrow-band  cases,  and  becomes 
1.76  (the  largest  speedup  or  profile  solution)  for  large  bandwidths 
Indeed,  Table  5  illustrates  the  high  correlation  between  speedup 
and  half-bandwidth  -  whether  arising  from  the  profile  of  a  D4- 
ordered  model  grid  or  the  natural  profile  of  an  irregular  grid. 

The  principal  exceptions  to  this  monotone  behavior  are  the 
elongated  32x128  model  grid  -  for  which  D4  ordering  produces  a 
small  profile  variation  -  and  grid  #1  of  Figure  6(a),  which  also 
has  a  nearly  constant  bandwidth.  In  either  case,  the  overhead  of 
profile  solution  seems  unwarranted. 

2.  Processor  Utilization 

For  the  irregular  grids,  the  floating  point  computations  were 
counted  before  and  after  blocking.  Indeed,  three  counts  were  made 


(Table  6) . 


1.  Original  Count.  This  is  the  count  of  operations  if 
solution  were  performed  on  a  scalar  processor;  this  count 
corresponds  to  the  LU  structure  in  Figure  7 (a) . 

2.  Unblocked.  The  column-by- column  elimination  requires 
dense  columns  of  L  to  the  bandedge,  represented  by  the  x  fill 
in  of  L  in  Figure  7  (b) .  The  resulting  total  operation  count 
is  termed  the  unblocked  count. 

3.  Blocked  Count.  This  count  includes  the  0  fill  n 
L  in  Figure  7  (b) .  The  factorization  count  does  not  include 
the  0  fill  in  U,  but  the  (back)  substitution  count  does 


include  this  fill. 
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Table  6  indicates  the  largest  percentage  operation  count 
increase  for  factorization  due  to  blocking  occurs  with  grid  #2. 

This  occurs  largely  because  the  L  blocks  are  merged  into  a  single 
block  by  the  previously-noted  strategy  of  merging  all  adjacent  L 
blocks  of  length  less  than  30.  Most  blocks  of  grids  #3  are  well  be¬ 
low  the  limit. 

Of  perhaps  more  significance  is  the  effective  MFLOP  rate. 

This  rate  is  obtained  by  dividing  the  number  of  original  operations 
(see  above)  by  the  solution  time;  it  therefore  discounts  the  opera¬ 
tions  due  to  non-zeros  resulting  from  blocking  and  allows  an  intuit¬ 
ive  comparison  with  other  vector  and  with  scalar  processors.  This  rate 
is  shown  to  be  over  80  MFLOPS.  In  contrast,  the  bandsolver  rate  is 
approximately  110  MFLOPS  for  a  half-bandwidth  of  55  (see  Table  4) . 


HBW 

Speedup 

Grid 

< 

u 

7 

.7 

#1 (8x69) 

1 

16 

.78 

16x16 

1  • 

28 

.95 

#2 (23x37) 

1  • 

32 

1.02 

32x32 

i  ; 

34 

.88 

32x128 

l 

39 

1.01 

#1 (8x69) 

5  : 

55 

1.32 

#3 (55x72) 

i 

64 

1.41 

64x64 

i 

71 

1.45 

#1  (8x69) 

9 

86 

1.60 

#2  (23x37) 

3 

98 

1.52 

32x32 

3 

144 

1.76 

#2  (23x37) 

5 

*Maximum  half  bandwidth 

Table  5.  Speedup  of  profile  over  banded 
solution  as  function  of  half  bandwidth. 


J 


NEQ 

Original 

Unblocked 

Blocked 

391 

Fac. 

15,259 

15,469 

17,691 

Sol. 

4,802 

4,838 

5,322 

507 

Fac. 

191,917 

224,853 

281,517 

Sol. 

19,805 

21, 429 

27,599 

2323 

Fac . 

3,754,265 

3,982,156 

4,093,469 

Sol . 

180,846 

187,008 

199,190 

Table  6.  Floating  point  operations  in  solution 
of  irregular  grids.  u  =  1. 


NEQ 

Effective 

Actual 

391 

Fac. 

7.  99 

9.  25 

Sol. 

17.7 

19.6 

507 

Fac . 

34.1 

50.1 

Sol. 

38.2 

53.2 

2323 

Fac . 

75.8 

82.7 

Sol. 

81.8 

90.1 

Table  7.  Effective  and  actual  MrrOP  rates 
in  solution  of  irregular  grids. 
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Software  Description  and  Calling  Conventions 

1.  LU  factorization 

Call  PROF AC  (A,  N,  ICCL ,  IBL,  NBL ,  IW) 
where 

A  is  the  matrix  array  storage 
N  is  the  number  of  equations 

ICOL  (3*1-2)*  is  the  location  in  A  of  the  Ith 
diagonal  (pivot  position) 

ICOL  (3*1-1)*  is  the  first  (smallest)  row  number 
in  the  Ith  column 

ICOL  (3*1)*  is  the  last  (largest)  row  number  in  the 
Ith  column 

IBL  (4*J-3)  is  the  first  column  number  in  the  Jth 
block 

IBL  (4*J-2)  is  the  storage  relative  to  A(l),  of 
the  (1,1)  position  of  the  Jth  block 
IBL  (4*J-1)  is  the  number  of  rows  in  the  Jth  block 
(always  positive) 

IBL  (4*J)  is  the  storage  increment  between  the  ( 

and  the  (K,  L+l)  positions  of  the  block 
NBL  is  the  total  number  of  L  and  U  blocks 
IW  (I)  is  the  L  block  number  that  includes  the  Ith 
column 

2.  Solution  (Forward  and  Backward  substitution) 

CALL  PROSOL  (A,  N,  Y,  ICOL,  IBL,  MOV) 
where 

Y*contains  the  right  hand  side  on  entry  and  the 
solution  on  exit;  it  must  be  dimen¬ 
sioned  at  least  N  +  6N^  +  5N2 

MOV* is  5N1 

5N1  is  1  -  (most  negative  row  number  in  the  blocked 
U  matrix) 

6N2  is  (most  positive  row  number  in  the  blocked  L 
matrix)  -  N 

*  Input  data  to  subroutine 


3.  Blocking  algorithm 

CALL  BlOCK  (IBL,  ICOL,  IW,  N,  M,  NPB ,  MOV,  IDSTOR,  NBL) 

IBL,  ICOL*,  IW,  N*,  MOV,  NBL  are  defined  in  calls 
to  PROF AC  and  PROSOL 

M  is  the  maximum  half-bandwidth 

NPB*  is  the  number  of  unknowns  per  grid  point 
DSTOR  is  the  storage  used  by  the  matrix  A.  (M  and 
IDSTOR  are  used  by  FORM  to  formulate  example 
equations,  but  in  general  are  unnecessary 
to  communicate  with  PROFAC  and  PROSOL.) 


Input  data  to  subroutine 


-40- 


F.  Example  Problems 


1.  Without  Automatic  Blocking 

A  driver  program  for  PROFAC  and  PROSOL  is  given  in  Table 
8(a).  The  symbolic  arrays  ICOL  and  IBL  need  be  formed  only  once; 
the  array  IW  and  the  displacement  MOV  may  be  formed  from  these 
two  arrays  as  shown  in  the  program. 

Input  data  for  the  profile  matrix  associated  with  the  D4 
ordering  of  an  8x12  grid  (see  Figure  5)  is  given  in  Table  8(b)-(c). 
The  output  array  Y(J)  has  the  solution  Y(J)  =  J  for  the  values 
given. 

2.  With  Automatic  Blocking 

A  driver  program  (included  on  the  tape  with  the  CAL-coded 
solver)  is  listed  in  Appendix  A.  For  symbolic  description ,  the 
user  may  input  either  (a)  ICOL  (3*J)  and  ICOL  (3*J-1)  or  (b)  the 
conventional  column-ordered  sparse  descriptors  of  L  and  U, 
assuming  that  all  columns  are  dense  from  the  first  to  the  last 
row  number.  Numeric  storage  for  the  matrix  is  formed  in  packed 
column-ordered  format;  storage  for  L  and  U  is  in  conventional 
column-ordered  banded  format,  using  the  maximum  half-bandwidth. 
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PROFILE  SOLVER  DRIVER 
DIMENSION  IBL(200) ,ICOL(300) ,IW( 100) 

DIMENSION  A( 1000) , B( 2000) ,1(100) 

EQUIVALENCE  (A< 1 ) ,B( 1000)) 

C*»«  READ  SYMBOLIC  DATA 
READ(5,1 )N,NBL,NA 
N3=3*N 

READ ( 5 , 1 ) ( ICOL( J ) , Jsl , N3) 

NBL4=4»NBL 

C«»«  THE  LU  FACTORS  ARE  STORED  BEGINNING  IN  NEGATIVE  STORAGE 
C  LOCATIONS  OF  MATRIX  A  STORAGE;  THESE  ARE  INDICATED  BY 

C  NEGATIVE  NUMBERS  IN  IBL(4*J-3) 

READ ( 5 1 1 )  ( IBL( J  ) ,  J  =  1 , NBL4) 

1  FORMAT( 1 015 ) 

NB=1 

DO  3  J  =  1  ,N 

IF(IBL(4*NB-3).EQ.J)NB=NB+1 

3  IW(J)sNB-1 
MOVsO 

DO  4  J  =  1 ,NBL 

4  MOV:MAXO( MOV , IABSC IBL( 4* J  —  1 ) ) ) 

C»»«  READ  NUMERIC  DATA 

READC  5,2)(A(J),J=1,NA) 

READ(5,2)(Y( J) ,J=1 ,N) 

2  FORHATC 10F5.0) 

CALL  PROFACt  A , N , ICOL , IBL, NBL , IW) 

CALL  PROSOLC  A , N , Y , ICOL, IBL, MOV ) 

WRITE ( 6,5)(Y(J),J=1,N) 

5  FORMAT  ( 5E1 2 . 4 ) 

STOP 

END 


(a)  Driver  program 


48  16  732 


1 

1 

5 

7 

1 

6 

14 

1 

9 

24 

1 

10 

35 

1 

1 1 

46 

2 

12 

57 

3 

15 

71 

3 

16 

86 

3 

17 

101 

4 

18 

116 

5 

19 

131 

6 

20 

146 

7 

23 

164 

7 

24 

183 

7 

25 

202 

8 

26 

221 

9 

27 

240 

10 

28 

259 

11 

28 

277 

12 

28 

294 

13 

31 

314 

13 

32 

335 

13 

33 

356 

14 

34 

377 

15 

35 

398 

16 

36 

419 

17 

36 

439 

1  8 

36 

456 

21 

37 

474 

21 

38 

493 

21 

39 

512 

22 

40 

531 

23 

41 

550 

24 

42 

569 

25 

42 

5  87 

26 

42 

602 

29 

43 

617 

30 

44 

632 

31 

45 

647 

32 

46 

662 

33 

46 

676 

34 

46 

6  87 

37 

47 

698 

38 

48 

709 

39 

48 

719 

40 

48 

726 

43 

48 

732 

44 

48 

1 

-306 

4 

21 

3 

-264 

6 

21 

7 

-1  80 

8 

21 

13 

-54 

10 

21 

28 

261 

8 

21 

36 

429 

6 

21 

4  2 

555 

4 

21 

48 

681 

0 

21 

48 

680 

5 

-21 

46 

638 

7 

-21 

42 

554 

9 

-21 

36 

428 

1 1 

-21 

21 

113 

9 

-21 

13 

-55 

7 

-21 

7 

-1  81 

5 

-21 

1 

-307 

1 

-21 

(b)  Symbolic  input  data 
Table  8.  Sample  driver  program  and  input  data 
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-1. 

-1 . 

20. 

-1 . 

-1. 

-1 . 

-1 . 

-1 . 

-1 . 

-1. 

-1 . 

-1  . 

-1 . 

-1  . 

-1 . 

-1. 

-1. 

-1 . 

-1 . 

-1 . 

-1 . 

20. 

-1 . 

-1 . 

-1 . 

-1 . 

-1. 

-1. 

-1 . 

-1 . 

-1  . 

-1 . 

-1 . 

-1 . 

-1 . 

-1. 

-1 . 

-1. 

-1 . 

-1 . 

20. 

-1 . 

-1  . 

-1 . 

-1. 

-1 . 

-1 . 

-1 . 

-1 . 

-1  . 

-1  . 

-1  . 

-1  . 

-1  . 

-1 . 

-1  . 

-1 . 

-1 . 

20. 

I 

-1  . 
-1  . 

-1  . 
-1  . 

-1  . 

-1  . 
-1  . 

-1  . 

-1  . 

-1  . 
-1  . 

-1 . 
-1  . 

-1  . 

-1 . 

20. 

-1  . 

-1  . 

-1  . 

-1  . 

-1 . 

-1  . 

-1. 

-1 . 

-1 . 

-1  . 

-1  . 

-1  . 

-1  . 

-1 . 

-1  . 

-1 . 

-1. 

20. 

-1. 

-1  . 

-1 . 

-1. 

-1. 

-1 . 

-1  . 

-1 . 

-1 . 

-1 . 

-1. 

-1  . 

-1. 

-1 . 

-1 . 

-1  . 

20. 

-1  . 

-1 . 

-1. 

-1 . 

-1 . 

-1 . 

-1 . 

-1 . 

-1 . 

-1 . 

-1 . 

-1 . 

-1 . 

-1 . 

-1. 

-1. 

-1 . 

-1. 

-1. 

20. 

-1. 

-1 . 

-1. 

-1. 

-1. 

-1. 

-1. 

-1 . 

-1 . 

-1 . 

-1 . 

-1 . 

-1. 

-1 . 

-1  . 

-1. 

-1. 

-1 . 

-1 . 

-1. 

20. 

-1 . 

-1. 

-1. 

-1. 

-1 . 

-1 . 

-1 . 

-1  . 

-1 . 

-1  . 

-1  . 

-1 . 

-1 . 

-1  . 

-1 . 

-1 . 

-1 . 

-1 . 

-1  . 

-1 . 

20. 

-1  . 

-1 . 

-1. 

-1 . 

-1  . 

-1  . 

-1. 

-1 . 

-1. 

-1. 

-1. 

-1 . 

-1  . 

-1  . 

-1 . 

-1 . 

-1 . 

-1 . 

-1 . 

-1 . 

20. 

-1 . 

-1. 

-1. 

-1. 

-1. 

-1. 

-1. 

-1 . 

-t. 

-1 . 

-1. 

-1. 

-1 . 

-1 . 

-1. 

-1  . 

-1  . 

-1  . 

-1 . 

-1 . 

20. 

-1  . 

-1 . 

-1 . 

-1 . 

-1 . 

-1  . 

-1 . 

-1  . 

-1 . 

-1  . 

-1  . 

-1  . 

-1  . 

-1 . 

-1 . 

-1 . 

-1 . 

-1. 

-1 . 

-1 . 

20. 

-1 . 

-1 . 

-1 . 

-1. 

-1 . 

-1 . 

-1 . 

-1 . 

-1 . 

-1 . 

-1 . 

-1 . 

-1 . 

-1. 

-1. 

-1  . 

-1. 

-1 . 

-1 . 

20. 

-1. 

-1  . 

-1. 

-1 . 

-1 . 

-1  . 

-1. 

-1. 

-1 . 

-1. 

-1 . 

-1 . 

-1. 

-1 . 

-1 . 

-1 . 

20. 

-1 . 

-1. 

-1 . 

-1  . 

_ 

-1 . 

-1 . 

-1 . 

-1 . 

-1 

-1. 

-1 . 

-1 . 

-1 . 

-1 . 

-1 . 

-1 . 

-1 . 

20. 

-1 . 

-1 . 

-1 . 

-1 . 

-1 . 

-1 . 

-1. 

-1. 

-1 . 

-1. 

-1. 

-1. 

-1. 

-1. 

-1. 

-1. 

-1. 

-1 . 

20. 

-1  . 

-1 . 

-1. 

-1 . 

-1. 

-1 . 

-1 . 

-1  . 

-1 . 

-1 . 

-1  . 

-1  . 

-1. 

-1 . 

-1 . 

-1 . 

-1. 

-1  . 

20. 

-1  . 

-1  . 

-1 . 

-1 . 

-1. 

-1 . 

-1 . 

-1  . 

-1 . 

-1 . 

-1  . 

-1  . 

-1 . 

-1. 

-1 . 

-1 . 

-1  . 

-1  . 

20. 

-1 . 

-1  . 

-1. 

-1 . 

-1. 

-1. 

-1. 

-1. 

-1  . 

-1  , 

-1 . 

-1 . 

-1  . 

-1  . 

-1. 

-1 . 

-1 . 

-1. 

20. 

-1  . 

-1 . 

-1 . 

-1  . 

-1 . 

-1 . 

-1. 

-1. 

-1 . 

-1 . 

-1  . 

-1 . 

-1 . 

-1  . 

-1 . 

-1. 

-1. 

-1 . 

20. 

-1 . 

-1  . 

-1 . 

-1 . 

-1  . 

-1 . 

-1  . 

-1 . 

-1 . 

-1 . 

-1 . 

-1. 

-1 . 

-1  . 

-1. 

-1. 

-1. 

20. 

-1 . 

-1 . 

-1. 

-1  . 

-1  . 

-1  . 

-1  . 

-1 . 

-1 . 

-1 . 

-1 . 

-1 . 

-1  . 

-1  . 

20. 

-1 . 

-1  . 

-1  . 

-1 . 

-1 . 

-1 . 

-1 . 

-1 . 

-1  . 

-1  . 

-1 . 

-1  . 

-1 . 

-1 . 

20. 

-1. 

-1 . 

-1 . 

-1  . 

-1 . 

-1 . 

-1  . 

-1 . 

-1 . 

-1. 

-1 . 

-1 . 

-1 . 

-1  . 

20. 

-1 . 

-1  . 

-1 . 

-1 . 

-1. 

-1 . 

-1 . 

-1  . 

-1  . 

-1 . 

-1  . 

-1  . 

-1  . 

-1. 

20. 

-1  . 

-1  . 

-1 . 

-1  . 

-1 . 

-1 . 

-1  . 

-1  . 

-1 . 

-1 . 

-1 . 

-1  . 

-1  . 

-1  . 

20. 

-1  . 

-1 . 

-1  . 

-1 . 

-1  . 

-1  . 

-1 . 

-1 . 

-1  . 

-1 . 

-1 . 

-1 . 

-1 . 

20. 

-1 . 

-1  . 

-1 . 

-1  . 

-1  . 

-1  . 

-1  . 

-1  . 

-1  . 

-1  . 

20. 

•1  . 

-1  . 

-1  . 

-1  . 

-1 . 

-1 . 

-1  . 

-1  . 

-1 . 

-i . 

20. 

-1  . 

-1  . 

-1  . 

-1  . 

-1  . 

-1  . 

-1  . 

-1  . 

-1  . 

-1 . 

20. 

-1 . 

-1  . 

-1 . 

-1  . 

-1  . 

-1  . 

-1 . 

-i . 

-1  . 

20. 

-1  . 

-1  . 

-1  . 

-1  . 

-1  . 

-1  . 

20. 

-i . 

-1  . 

-1  . 

-1  . 

-1  . 

20. 

6. 

21  . 

18. 

29. 

39. 

49. 

30. 

35. 

39. 

45. 

51  . 

57. 

18. 

15. 

1 1  . 

13. 

15. 

17. 

48. 

80. 

23. 

12. 

-0. 

-0. 

-0. 

-0. 

37. 

75. 

116. 

99. 

31. 

83. 

85. 

87. 

132. 

178. 

237. 

243. 

249. 

255. 

i 

308. 

362. 

441  . 

**5 1 . 

510. 

570. 

714. 

77  8. 

(b) 

Numeric  input  data 

* 

Table 

8. 

Continued 

| 

-43 

- 

* 

L--  - 

J 

References 


[1]  Jordan,  T . ,  and  K.  Fong,  "Some  Linear  Algebraic  Algorithms  and 
and  Their  Performance  on  the  CRAY-1,"  Report  LA-6774,  Los  Alamos 
National  Laboratory,  June,  1977. 

[2]  Calahan,  D.  A.,  "A  Block -Oriented  Equation  Solver  for  the  CRAY-1," 
Report  #136,  Systems  Engineering  Laboratory,  University  of  Michi¬ 
gan,  Ann  Arbor,  December  1,  1980. 

[3]  Orbits,  D.  A.,  "A  CRAY-1  Simulator,"  Report  #118,  Systems  Engi¬ 
neering  Laboratory,  University  of  Michigan,  Ann  Arbor,  September, 
1978. 

[4]  Duff,  I.  S. ,  and  J.  K.  Reid,  "Experience  of  Sparse  Matrix  Codes 
on  the  CRAY-1,"  Report  CSS  116,  AERE  Harwell,  October,  1981. 

[5]  George,  J.  A.,  "Nested  Dissection  of  a  Regular  Finite  Element  Mesh," 
Siam  Jour.  Num.  Anal.,  vol.  10,  1973,  pp.  345-363. 

[6]  Calahan,  D.  A.,  and  W.  G.  Ames,  "Vector  Processors:  Models  and 
Applications,"  Trans.  IEEE,  vol.  CAS-26,  no.  9,  September,  1979, 
pp.  715-726. 

[7]  Calahan,  D.A.,  "Performance  of  Linear  Algebra  Codes  on  the  CRAY-1," 
SPE  Journal,  October,  1981,  pp.  558-564. 

[8]  Gustavs on,  F.  G. ,  "Some  Basic  Techniques  for  Solving  Sparse  Systems 
of  Linear  Equations,"  in  Sparse  Matrices  and  Their  Applications , 

Ed.  by  Rose  and  Willoughby,  Plenum  Press,  1972,  pp.  41-52. 

[9]  Calahan,  D.  A.,  P.  G.  Buning,  and  W.  N.  Joy,  "Vectorized  General 
Sparsity  Algorithms  with  Backing  Store,"  Report  #96,  Systems 
Engineering  Laboratory,  University  of  Michigan,  January  1977. 

[10]  Calahan,  D.  A.,  "Sparse  Vectorized  Direct  Solution  of  Elliptic 
Problems,"  in  Elliptic  Problem  Solvers ,  Ed.  by  M.  H.  Schultz, 

Academic  Press,  1981,  pp"!  24 ]  -245 . 

[11]  Price,  H.S.  and  K.  H.  Coates,  "Direct  Methods  in  Reservoir  Simu¬ 
lation,"  SPE  Journal,  vol.  14,  1974,  pp.  295.308. 

[12]  Woo,  P.  T. ,  and  J.  M.  Levesque,  "Benchmarking  a  Sparse  Elimination 
Routine  on  the  CYBER  205  and  the  CRAY-1,"  6th  SPE  Symposium  on 
Reservoir  Simulation,  New  Orleans,  Feb.  1-3,  1982,  pp.  535-538. 

[13]  Calahan,  D.A.,  W.  G.  Ames,  and  E.  J.  Sesek,  "A  Collection  of 
Equation-Solving  Codes  for  the  CRAY-1,"  Systems 'engineering  Lab¬ 
oratory,  University  of  Michigan,  August  1,  1979,  '.rev.  11/71). 

[14]  Woo,  P.  T. ,  S.  C.  Eisenstat,  M.  H.  Schultz,  and  A.  H.  Sherman, 
"Application  of  Sparse  Matrix  Techniques  to  Reservoir  Simulation," 

in  Sparse  Matrix  Computations,  Ed.  by  Bunch  and  Rose,  Academic  Press, 
1976,  pp.  527-438. 


<PAGE  1>  /////  FILE: PROF  I LE . DR  lllll  »>>>  MAIN  PROGRAM  <<<<<  <PAGE 


<PAGE  5>  /////  FILE :PROFILE  DR  /////  >>>>>  SUBROUTINE  RFADIIJ  "<<<  'PAGE 


(nTrintot^coaiO-"<N 


I 


z  o 

UJ  h- 

Z 

2  in 

z 

lU 

< 

-»  O 

uj  *-<  2 

a  a 

X 

j  uj  O 

04 

■*» 

<  2  u- 

a 

Z  3 

h- 

i 

O  Z  uj 

< 

O  z 

2 

CO 

<  X  ii 

a 

►*«  *■“«  H- 

u. 

z 

act  o 

O 

_  *-  o 

I  <  a 

z 

+ 

*-  2  00 

o 

-3  3 

N 

*- 

uj  in 

»- 

u.  2 

a.  i 

O  3  Z  ~ 

in 

-1  o 

(/>*-•  a 

O 

►—  -j 

o 

a. 

+  l/lw 

CM  <  Z  H 

a 

1  UJ  to 

k 

”3  ••  >  O 

00 

■*- 

*  <1  *■«  ►H 

< 

(J  ii  00 

n  t-  O 

-j 

<-*  a 

~  < 

^  z 

CJ  Q  l/l  C0 

u. 

1  CM 

•-  <  a 

O 

* 

>-  z 

3  i 

Z  3  * 

Z 

w  » 

o  o.  z  - 

o 

~  3 

m.  z  2  Z 

* 

3 

K 

i 

<  -J  - 

< 

00  * 

O  Z  C  -l 

u 

CL  CM 

o  *■*  o  o 

o 

Z  3  o 

u 

-J 

»** 

z  >  ~ 

■f  *  w 

o  UJ  CO  w 

UJ 

i 

CO 

CO 

—  >  > 

~  o 

a 

oc  a  + 

a.  •-  Q  — 

—  < 

3 

3  Z  O  Z 

uj  o  uj  a 

a 

* 

* 

h-  *  a 

2  X  O 

-»  o 

Z  n 

n 

• 

i/>  r>  o 

3  h-  u  O 

o  ►- 

w 

•w 

«- 

Q  K- 

Z  O  <  -J 

u  in 

-  -J 

— 1 

*-*  ♦  oo 

Z  a 

** 

-  o 

o 

ti 

o 

UJ  UJ 

to  O 

u 

o 

II  -J  ~ 

Zu.  uZ 

Z  - 

II  *-« 

-j 

4 

—  03  —  O  n 
2  i-  •-«  tt  3  m  h 


a  -  O  3 
uj  <  ►-  O 

»-  a 

ui  2  (/>  CO 
O  ~  3 

oo 


ifloa 

Z  t“  O  O  r  (N 

UJ  l/l  K  U  u 

S  Q  1/1  *— 

~  «  o  o 

O  —  Q 


o  n  n 

o  o  ~ 

-  k  -j  a  Z 

tn  o  o  a 

O  O  O  »-  3 

O  ~  t/>  »-  o 

Q  UJ  Z 

•—  a  uj 


r**ooO>0^<Nn^Tto<i)r-{Dmo*"CN 

r*p^!^oococoeooooocococo.:o)0)0) 


